fix(coherence): Debugging to Get Qwen Working In Agentic Coding by tbraun96 · Pull Request #90 · Avarok-Cybersecurity/atlas

tbraun96 · 2026-05-23T13:45:37Z

The bug. Live-repro'd against opencode driving Qwen/Qwen3.6-35B-A3B-FP8 on Atlas with 9 tool schemas, temperature=0.3, thinking_in_tools=true. The model emits literal <tool_call><function=…><parameter=…>…</parameter></function></tool_call> XML inside its <think> block, repeated ~6 times before the thinking budget force-closes </think>. The literal XML reaches opencode's rendered transcript as reasoning_content deltas and visibly contaminates the agentic session. Reproducible across multiple sessions; see /home/nologik/opencode-session.md and /workspace/atlas-dumps/opencode.jsonl.

Root causes (3-agent Explore + 1 Plan):

Thinking-loop watchdog's period range is 4..20 tokens (helpers.rs:70-79) — the leaked XML is ~50 tokens/occurrence, outside the detector.
Single-token tool_call_start_token hard-mask at decode_logits_seq.rs:137-143 is a no-op when <tool_call> is multi-token.
Grammar bitmask is disabled while thinking (decode_logits_seq.rs:261-273).
SimHash, loop-watchdog, F11/F44/F5/Bug-2 caps all gated off in thinking.
Post-hoc reasoning stripper at handle_token.rs:91-112 is per-delta and can't stop the model from generating the next repetition.
Sampling regime: opencode forces temp=0.3, top_p=1 (vs preset 0.6/0.95); DRY disabled; 9 tool schemas in context apply prior pressure.

MTP is not a cause (drafts are greedy argmax; F62 SSM rollback safe; leak repros pre-MTP-enable).

Fix — two layers

Layer D (hotfix, 1 line)

kernels/gb10/qwen3.6-35b-a3b/MODEL.toml: thinking_in_tools = true → false. With thinking off during tool-active turns the <think> block never opens. Re-enable once Layer A is validated end-to-end.

Layer A (real fix, ~75 LoC in 2 files)

Cross-delta scanner in the chat_stream thinking branch. On match: drop the delta, set tool_loop_capped, set stop_string_triggered, set cancel_flag — scheduler finalises within one decode step (PR #89), handle_done emits finish_reason="length" via the PR #87 override.

Files:

crates/spark-server/src/api/chat_stream/state.rs — reasoning_xml_scan_buf (rolling 256-char tail) + reasoning_xml_leak_detected one-shot.
crates/spark-server/src/api/chat_stream/handle_token.rs — scanner + short-circuit guard in the thinking phase only.

Triggers: <tool_call>, <function=, <parameter=, <invoke (Qwen3-Coder, Hermes, MiniMax-XML variants). Gated on ctx.tool_defs_for_backfill non-empty so a benign chat that mentions <tool_call> does not trip.

Layer B (deferred)

Watchdog period extension — needs codegen plumbing through 6 files for new MODEL.toml [behavior] keys. Tracked separately; Layer A alone closes the leak.

Verification

Local: cargo check, cargo clippy --tests, cargo fmt --check, cargo test -p spark-server (484 passed). handle_token.rs 476 LoC (≤500 cap).
Live replay: image rebuilt locally, deployed on dgx1, replay of jq 'select(.seq==1)' /workspace/atlas-dumps/opencode.jsonl. Expect: no <tool_call> / <function= / <parameter= substrings in reasoning_content; terminal SSE finish_reason: "length"; the new tracing::warn! line appears in server logs.

Docker Hub push held until user re-runs an opencode session and confirms the leak is gone.

Stacked on top of PR #89 (cancel_flag plumbing) — base branch is fix/scheduler-cancel-flag's landing point. Once #89 merges to main this will rebase cleanly.

…ps the response When the Bug-2 name-run cap (or F11 within-dedup / F5 cross-flush dedup / F44 perm-fail circuit-breaker) forcibly ends a streaming response, `finish_reason` was previously `"tool_calls"` — because tool calls *were* emitted, just truncated mid-loop. Agent clients (opencode and friends) see a normal-looking tool-call completion, dutifully run the tools, send the next request, and the model loops again — Atlas was breaking the loop one round at a time without ever telling the client. Add a `tool_loop_capped: bool` on `StreamState`, flipped true alongside `stop_string_triggered` at every tool-call loop guard (4 sites in `tool_handlers.rs`). `handle_done` reads it and overrides `fr` to `"length"` — OpenAI's spec slot for "response was forcibly truncated" — ahead of the existing `"tool_calls"` / `finish_reason` fall-throughs. This gives every agent client a clean, spec-compliant hook to break its outer retry loop without needing Atlas-specific headers. Also dumped to the `--dump` synthesized-response body for observability. Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check` all clean. Live repro will follow once the image is rebuilt. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ppress output The PR #87 fix changed finish_reason to "length" when a tool-loop guard trips, so agent clients can break their outer retry loop — but only when the scheduler actually finalises and emits Done. Live repro on opencode revealed the deeper bug: setting `stop_string_triggered = true` in chat_stream only suppresses *output*; the scheduler keeps generating tokens until natural EOS or `max_tokens`. On a degenerate-loop response (model not EOS-ing), this manifests as a hang — the stream silently consumes tokens, the channel can fill, the scheduler can block on `blocking_send`, GPU goes 0%, no Done event ever fires, opencode sits forever waiting on the SSE stream. Add a cooperative cancellation flag plumbed from chat_stream into the scheduler: Arc<AtomicBool> cancel_flag │ ├── created in chat_stream/mod.rs ├── passed into InferenceRequest::Streaming { cancel_flag, .. } ├── stashed on StreamState (cancel_flag) — chat_stream flips true on: │ • Bug-2 name-run cap trip (handle_complete_tool_call, │ handle_tool_call_end) │ • F11 within-response dedup │ • F44 perm-fail circuit-breaker │ • cross-flush tool_arg_dedup trip │ • loop-watchdog fire (SimHash + substring repeat) └── carried through PrefillInProgress → ActiveSeq on the scheduler side; `emit_step::emit_token` reads it at the top of every token-emit and sets `a.finished = true` if flipped — equivalent to an EOS, so the existing finalize path runs and `handle_done` emits the proper `tool_loop_capped` / `finish_reason="length"` chunks + `[DONE]`. Spill-restored ActiveSeq carries `cancel_flag: None` — the original streaming connection is long gone by the time a swapped-out seq resumes from disk. /v1/completions also passes a fresh never-flipped flag so the scheduler's type-check is satisfied; the guard pipeline doesn't run on that legacy path yet. Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check`, `cargo test -p spark-server` (484 passed), `cargo build --release` all clean. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…+ opencode) Layer-D hotfix + Layer-A primary fix (Layer-B watchdog-period extension deferred to a follow-up — the codegen plumbing for new MODEL.toml [behavior] keys spans 6 files and Layer-A on its own closes the leak end-to-end). **The bug.** With opencode (9 tool schemas, temp=0.3, 16k+ system prompt) Qwen3.6-35B-A3B emits literal `<tool_call><function=…><parameter=…>… </parameter></function></tool_call>` XML INSIDE its `<think>` block, repeated 6+ times before the thinking budget force-closes `</think>`. The leak reaches opencode's rendered transcript as `reasoning_content` deltas and visibly contaminates the agentic session. Three-agent investigation pinpointed six root causes; this patch closes the two biggest gaps. **Layer D — `kernels/gb10/qwen3.6-35b-a3b/MODEL.toml`** (hotfix): `thinking_in_tools = true → false`. With thinking off during tool-active turns, the `<think>` block never opens and the leak cannot occur. Re-enable once Layer A is validated end-to-end via opencode. **Layer A — stream-side scanner with hard cancel** (the real fix): - `crates/spark-server/src/api/chat_stream/state.rs`: two new fields on StreamState — `reasoning_xml_scan_buf: String` (≤256-char rolling tail of decoded reasoning text across deltas) and `reasoning_xml_leak_detected: bool` (one-shot). - `crates/spark-server/src/api/chat_stream/handle_token.rs` (thinking branch only): after the existing per-delta strippers, scan the cross-delta buffer for `<tool_call>` / `<function=` / `<parameter=` / `<invoke ` openers (covers Qwen3-Coder, Hermes, MiniMax-XML variants). On first match: 1. Drop the current delta (no reasoning_chunk emit). 2. Set `state.tool_loop_capped = true` — PR #87's override in `handle_done.rs:151` then maps `finish_reason` to `"length"`, which agent clients (opencode, Claude Code, …) treat as "forcibly truncated" and break their outer retry loop. 3. Set `state.stop_string_triggered = true` for the suppression path. 4. Set `state.cancel_flag` (PR #89) — the scheduler reads it at the top of `emit_step::emit_token` and finalises the sequence within one decode step. 5. `tracing::warn!` with model + request id + last 64 chars of the scan buffer for postmortem. Subsequent thinking-phase tokens for the stream short-circuit at the top of the branch with empty SSE output (handles the in-flight race between cancel set and emit_token noticing). Scanner is gated on `ctx.tool_defs_for_backfill` being non-empty (i.e. tools_active) so a benign chat that genuinely discusses the `<tool_call>` literal does not trip it. **Verified.** `cargo check`, `cargo clippy --tests`, `cargo fmt --check`, `cargo test -p spark-server` (484 passed) all clean. `handle_token.rs` 476 LoC — under the 500 cap. Live replay verification follows image rebuild against the running dgx1 instance. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Checkpoint of work-in-progress before the Phase A injection-removal refactor (see bench/qwen36_fp8_dequant_audit/prompt_injection_audit.md and post_processor_comparison_atlas_vs_vllm.md). Phase 2 numerical-drift fixes (Qwen3.6-35B-A3B-FP8): - RNE rounding in f32_to_bf16 (atlas-quant/src/fp8.rs + spark-model/src/weight_map/fp8_lut.rs) - __expf softmax in prefill_paged_compute.cuh (polynomial path retained behind ATLAS_FAST_SOFTMAX_EXP) - FP16 P x V MMA (10-bit mantissa vs BF16's 7-bit) in prefill_paged_compute*.cuh Watchdog rewrites (scheduler/{helpers,decode_logits_*}.rs): - vLLM-anchored content-loop detector (has_repeating_pattern_anchored) - disable_watchdogs() env-var kill switch - mid-word </think> defer via mid_word_token_mask - describe_content_token_loop diagnostic re-scan on fire Build infra (atlas-kernels/build.rs + build_target.rs): - Dedup + std::thread::scope parallel NVCC compile; 7min -> 75sec (242 unique invocations vs 1169 raw, 4.8x dedup, 20 workers) MODEL.toml sampler band-aid (all 11 GB10 model dirs): - Empirical defaults (temp=0.6, pres=0.0, rep_pen=1.1, dry=0.5) that produced clean opencode sessions. Documented Qwen-card "correction" regression to anchor future tuning. - max_inter_tool_prose raised 384 -> 2048 (was killing legitimate multi-paragraph explanations between tool calls). Audit artifacts (bench/qwen36_fp8_dequant_audit/): - prompt_injection_audit.md cataloging 13 always-on prompt mutations - post_processor_comparison_atlas_vs_vllm.md identifying 5 vLLM wins - per_model_sampler_recommendations.md research baseline Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Atlas previously ran a stateful agentic-failure-handling layer that mutated the user-supplied message array in 13 distinct ways before tokenization — none opt-in per request, all keyed on heuristics over message history (audit in bench/qwen36_fp8_dequant_audit/ prompt_injection_audit.md). Phase 2 precision fixes (RNE, __expf, FP16 P×V, vLLM-anchored loop detection) make these obsolete and they now train the model to react to XML tags it wasn't trained on. Deleted injection sites: 1. Tool-parser system_prompt prepend (api/chat/mod.rs:106-118) — jinja template's `tools` argument is the legitimate scaffolding 2-9. F-feature failure guards (chat_phases::apply_failure_guards and the entire api/failures/ subtree): F7 stall reminder, F23 progress reminder, F29 environment_facts, F31 hard-refusal, F32 dup tool_result, F35 failure_recovery, F39 circuit banner, F49 dup-write banner, F50 original-error append 10. Loop-detector <IMPORTANT> hint append (loop_detect.rs:131-146) 11. task_pin verbatim-goal reminder ("The user's ORIGINAL request was: «...»") — task_pin.rs deleted, caller in loop_detect.rs 12. observation_mask body rewrite (rewrote past tool/user error bodies in place with "[stale tool failure N/M: ...]") — observation_mask.rs deleted, caller in chat/mod.rs 13. Responses-API instructions stacking (responses_lowering.rs): simplified to spec-faithful insert at pos 0 (no drop-of-prior) 14. /v1/completions `<think></think>\n\n` raw-prompt prefix (completions.rs:74-78) Preserved legitimate streaming guards (NOT prompt injections; operate purely on outbound model text) by extracting them from the deleted failures subtree into a new api/stream_guards.rs module: - bump_f12_tool_call_count: per-response tool-call cap - check_loop_watchdog: post-detector repeating-line detector - flush_content_sanitizer: stream-end tag-tail drain Telemetry counters removed alongside their injectors: - atlas_task_pin_injections_total - atlas_observation_mask_elided_bodies_total Net: -3480 LoC (40 files), zero injection sites remaining (verified zero matches for atlas_runtime_notice / failure_recovery / environment_facts / IMPORTANT> / stale tool failure / ORIGINAL request was). cargo check + cargo check --tests both green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

…ctionParams Two small per-request controls ported from vLLM. Phase B-2 (thinking_token_budget) was already present on ChatCompletionRequest (top-level + chat_template_kwargs paths), so only B-1 and B-3 needed work. B-1: stop-string hold-back buffer (vLLM IncrementalDetokenizer pattern) - StreamCtx.stop_string_buffer_len cached once at request setup as max(stop_strings.len()) - 1 - StreamState.stop_string_emitted_len tracks forwarded bytes - handle_token::apply_stop_string_holdback() helper: - On miss: hold back trailing buffer_len bytes (snapped to UTF-8 char boundary via floor_char_boundary) - On match: truncate BOTH accumulator and emittable delta at match position (mirrors vLLM detokenizer.py:140-142) - handle_done flushes any held-back tail bytes through the active detector/sanitizer at stream close - 3 unit tests: chunk-boundary leak prevention, no-stop-strings zero behavior change, UTF-8 boundary safety - Hand-tuned <think>-tag boundary deferral untouched B-3: per-request RepetitionDetectionParams - New struct {min_pattern_size, max_pattern_size, min_count} on ChatCompletionRequest + CompletionRequest + InferenceRequest - Threaded into ActiveSeq, PrefillInProgress, SwappedSeq (survives swap/restore) - helpers.rs: new detect_*_token_loop_with(tokens, Option<override>) variants; legacy detect_*_token_loop() delegates to _with(.., None) for byte-identical fallback to watchdog_params() constants - decode_logits_{step,content}.rs pass a.repetition_detection to the detectors - Fuzzy + digit-normalized SimHash detectors left untouched - 3 unit tests for override path (loosen / tighten / thinking-loop) Test cleanup: detects_fence_body_with_varying_prefixes — was written for the OLD scan-anywhere detector and broke when we switched to vLLM-anchored (commit 1bb82ed). Inverted assertion + renamed to rejects_*: anchored detector intentionally does not fire on this varying-prefix pattern; the rollback-to-boundary machinery catches the resulting tighter loop one layer up. cargo check + cargo check --tests + clippy all green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Live opencode session (opencode-phaseAB.jsonl, 2026-05-24) on the post-PhaseB image surfaced two distinct user-visible failures: 1. **Content-loop watchdog firing on legitimate JSON tool-call body.** `CONTENT_LOOP_MIN_REPEATS=2` + `PERIOD_MIN=2` meant the watchdog fired on FOUR matching end-anchored tokens. JSON structural punctuation (`","`/`":"`) naturally forms period-2 patterns inside tool-call bodies — observed 18:13:18 firing at content_tokens=48 inside a bash() invocation, ending the response with `reason=NoBoundary` (rollback declined because mid-grammar). The tool-salvage path recovered a garbled `bash({"command":""})`. Fix: `CONTENT_LOOP_MIN_REPEATS` 2 → 3. Matches vLLM's `RepetitionDetectionParams.min_count` default. Period-2 with 3 repeats now requires 6 matching end-anchored tokens — still catches genuine `[A, B]` attractors (~100ms after onset), but tolerates the `","`/`":"` JSON-punctuation band. Per-request `repetition_detection` (Phase B-3) still wins over this default, so clients with stricter or looser needs can override. 2. **2048-token thinking budget burned on agentic turns.** Two failures traced to `max_thinking_budget=2048` (qwen3.6 MODEL.toml): - seq=1 (title-gen, 0 tools): 2068 thinking tokens for a 7-token title. - seq=8 (agentic turn 9): full 2048 budget exhausted, model force-closed via `</think>` injection, then emitted EOS with zero content tokens. 33s of stall. Fix: `qwen3.6-35b-a3b/MODEL.toml.max_thinking_budget` 2048 → 768. Bounds worst-case thinking burn at ~30s. Typical agentic thinking is 200-500 tokens; 768 leaves headroom. Clients can pass `thinking_token_budget` per-request (Phase B-2) for chat workflows that genuinely need more. Tests updated: `content_loop_accepts_two_repeats` → `accepts_three_repeats` + added negative `rejects_two_repeats`; `override_loosens_content_loop_threshold` updated for new default (3-repeat fires, 4-repeat doesn't). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Live opencode session on `atlas-gb10:hotfix` (opencode-hotfix.jsonl seq=10, 2026-05-24): after Atlas's tool-validation correctly rejected a `write({})` call with empty filePath, the model degenerated into a doom loop emitting partial `<tool_call>` envelopes that never closed. The streaming sanitizer correctly detected the orphan opener and flipped `suppressing_param_leak=true`, then drained the subsequent content waiting for a close — but no close arrived for 8192 tokens (max_tokens limit). The content-loop watchdog couldn't catch this because partial-envelope periods exceed `CONTENT_LOOP_PERIOD_MAX=64`. User-visible result: 116s of "hung" streaming with finish=length, content="" (sanitizer suppressed all of it). Fix: add `suppress_streak_tokens: u32` to `StreamState`. Per-token on the chat_stream path, increment when `suppressing_param_leak` is true at end-of-token; reset when it flips false. When the streak exceeds `MAX_SUPPRESS_STREAK_TOKENS=256`, trip the same kill switch the content-loop watchdog uses (`loop_watchdog_triggered=true`, `stop_string_triggered=true`, `cancel_flag.store(true)`). 256 tokens is enough headroom for legitimately long tool-call bodies (large `content` strings on a `write` call) while bounding worst-case wasted decode at ~10s @ 30 tok/s. The 8192-token doom loop now ends in ~10s with finish_reason=length and a tracing warn line carrying the streak count, so future occurrences are self-debuggable. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Scaffold the `LogitsProcessor` trait + 8 processor implementations from the inline pre-sample masking block in `decode_logits_seq::process_seq_logits`. **Not yet wired** — this commit only introduces the new module; the inline block in `process_seq_logits` still runs the same logic. Part 2 will replace the inline block with `run_pipeline(...)`. Module layout (`crates/spark-server/src/scheduler/logit_processors/`): - `mod.rs`: `LogitsContext`, `ProcessorOutcome` enum, `LogitsProcessor` trait with `apply(&mut [f32], &mut ActiveSeq, &LogitsContext) -> ProcessorOutcome`, `name()`, `is_argmax_invariant()`, and the `run_pipeline(...)` driver wiring the eight stages in this order: 1. F2ConfidenceEarlyStop 2. MidWordThinkEndMask 3. PostCloseThinkMask 4. ToolCallDuringThinkingMask 5. ForcedThinkEndInjector 6. PinToToolCallStart 7. ForcedTokenFastPath ← may short-circuit (EmitToken) 8. GrammarBitmaskApply - One file per processor (~40-80 LoC each) preserving byte-identical gate semantics from the monolith: f2_confidence.rs / mid_word.rs / post_close.rs / tool_during_think.rs / forced_think_end.rs / pin_tool_call.rs / forced_token.rs / grammar_bitmask.rs - `pipeline_tests.rs` (cfg=test): stage-name stability + distinctness, `is_argmax_invariant` advertisement matrix, `should_inject_think_end` truth table, defer-override constants pin. Integration replay against the wired pipeline lives in opencode-session.md once part 2 lands (ActiveSeq is too complex to fixture cleanly). Out of scope (intentional): AdaptiveSamplingState entropy observation runs after this pipeline — it's a sampling-policy decision, not a logit transform. The final `sample_with_params_history` call is also downstream. Both stay in `process_seq_logits` after the wire. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

… fall-through Live opencode-hotfix2.jsonl seq=4 (2026-05-24 19:00→19:02): the hotfix-2 streak watchdog DID NOT fire on an 8215-token doom loop (content="", finish=tool_calls, max_tokens=8192 effectively burned). Root cause: `handle_token` has 8+ early-return branches; the streak check was at the natural fall-through, so most token paths skipped it entirely. Fix: split `handle_token` into a thin wrapper + `handle_token_inner` body. The wrapper runs the streak update + kill check AFTER the inner body returns, so every token path increments/resets the counter and trips the kill switch when warranted. Body now has the same early-return shape; only the watchdog block moved. The 256-token streak bound was correct; just wasn't reachable. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

ROOT CAUSE found 2026-05-24 (opencode-hotfix2b.jsonl seq=13, 8222 tokens, finish=length, content="", reasoning=29): `process_decode_logits` is the non-MTP decode path. It calls `handle_content_token` which contains the content-loop watchdog (period-2…64 detector, MIN_REPEATS=3). All the prior watchdog tuning applies HERE. `verify_k2_step` / `verify_k3_step` / `mtp_step` / `spec_step` are the MTP / speculative-decode paths. They call `emit_token` directly, which only bookkeeps the emitted token (push to output_tokens, advance grammar, EOS check) — it does NOT run any watchdog. When `--speculative` is on (which is on by default for this build since `--speculative --mtp-quantization bf16` are in the docker command), essentially every token goes through the MTP path. The content-loop watchdog never fires on MTP-emitted tokens. The recently-tuned MIN_REPEATS=3, the per-request RepetitionDetectionParams plumbing, and the rollback machinery — all unreachable for the common case. Observed live: model degenerated into a period-4 `[29, 198, 510, 15704, …]` attractor (the `parameter>\n` doom-loop from the project's prior memory note). 8193 content tokens of pure repeat, zero watchdog fires, response burned to max_tokens=8192. Fix: mirror the content-loop watchdog check inside `emit_token` for the `!inside_thinking` branch. Increments `a.content_tokens`, re-uses the same detector functions and gate constants (`disable_watchdogs`, `enable_loop_watchdog`, `CONTENT_LOOP_*`, `detect_content_token_loop_with`, normalized variant). On fire: set `a.finished = true` and warn. Skip rollback in this mirror — `emit_token` doesn't take `&dyn Model` (the SSM rewind requires it) and plumbing it through the 14+ MTP/spec call sites would be a separate diff. End-of-stream is a strict improvement over the current 8192-token burn; rollback can be added later if needed. The non-MTP path retains rollback via `handle_content_token` unchanged. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

…aths Closes the root-cause loop the team-debug investigation identified (see bench/hotfix3-debug/SYNTHESIS.md): MTP K=2/K=3/K=4/dflash verify paths were returning raw GPU argmax with NO pre-sample logit processors applied. Every MTP-emitted token escaped grammar_bitmask, mid_word_mask, post_close_mask, tool_during_think_mask, forced_think_end_injector, pin_to_tool_call, forced_token_fastpath, and f2_confidence_early_stop — causing grammar desync, malformed tool calls (empty filePath, invented 'description' tool names), mid-word </think> cuts, stray <think> re-entry, and the prompt_tok>13k cliff documented in opencode-hotfix3.jsonl. ## Non-MTP path (P1a) `decode_logits_seq::process_seq_logits`: 435 → 185 LoC. Replaced the inline 200-line 8-stage masking block with a single `run_pipeline(&mut f32_logits, a, &ctx)` call. `LogitsContext` carries think_start/end and tool_call_start/end tokens; pipeline driver short-circuits on ForcedTokenFastPath::EmitToken. Dequant block, adaptive-sampling/sample_with_params_history, and extract_logprobs_from_f32 unchanged. Function-signature changes: `_think_start_token` and `_tool_call_end_token` lost the underscore prefix — pipeline processors consume them. Sole caller (`decode_logits_step.rs:82`) already passes by name; no other call-site changes. ## MTP verify paths (P1b) New helper: `crates/spark-server/src/scheduler/verify_pipeline_helper.rs` (148 LoC): - `verify_pick_with_pipeline(logits_bytes, ...)`: dequant + run_pipeline + argmax for ONE verify position (~0.4ms/position @ 256k vocab) - `verify_pick_all_with_pipeline(model, argmax_ids, ...)`: copy [K, vocab] BF16 verify logits D2H, process all K positions in sequence; falls back to raw argmax on D2H failure Wired into: - `verify_k2_step` (k=2 mainline) - `verify_k3_step` (k=3) - `verify_k4_step` (k=4) - `verify_dflash_step` (dflash spec-decode) - `spec_step::step_self_spec` (self-speculation) - `spec_step::step_ngram_verify` (ngram k=2) - `mtp_step` (Phase A bootstrap + Phase B verify dispatch) LogitsContext is built once per scheduler tick in `scheduler/mod.rs` and threaded through `step_mtp`/`step_self_spec`/`step_ngram`. Drafts are left alone — they already go through grammar bitmask inside `run_mtp_propose_multi` per `mtp_head/forward.rs:382-463`. GPU verify still bakes argmax inside the CUDA graph; the new path adds a one-shot D2H copy (~K × 0.4ms) which is NOT graph-captured (constraint from the helper design). Reuses `model.copy_logits_to_host()` infrastructure rather than touching `verify_b.rs` / `verify_c.rs` — additive, old argmax path remains for non-pipeline call sites if needed. ## P2: restore `!a.inside_tool_body` gate on content-loop watchdog Both `handle_content_token` (decode_logits_content.rs) and the hotfix-3 MTP mirror in `emit_step.rs` now skip the watchdog inside the tool body. xgrammar guarantees structural termination there; repeated rejected drafts that look like a period-N attractor were producing false-positive amputations of valid JSON. The `parameter>\n`-style real loops are still caught the instant the model exits the tool body. ## P3: non-silent gs.accept_token failure `emit_step.rs::emit_token` now captures `gs.accept_token(tok)` return value. On `false` (token violates grammar — xgrammar NPDA refused), warn + `a.finished = true`. Today's silent desync corrupted all downstream bitmasks for the rest of the response. After P1a+P1b, this should rarely fire — defense-in-depth. ## P4: K2/K3/K4 ACCEPT/REJECT log fix Replaced `seq_len.is_multiple_of(50)` gates (which logged every REJECT but only every 50th ACCEPT, misleading the team-debug investigation into thinking accept rate was 0.32% when actual was 41.5%). Now per-100-step summary `info!` with accepts/rejects/rate via `AtomicU64` counters; per-step events demoted to `debug!`. ## P5: flush `reasoning_tag_scan_buf` on `</think>` `handle_token.rs` previously dropped up to ~18 trailing bytes of every thinking block on `</think>` fire — the sanitizer's tail buffer was never flushed. Added an explicit flush block (skipped when reasoning-side suppression is active). ## Out of scope Model-side FP8 KV precision drift at long context ("axut"/"withcurl" spelling errors documented in findings-agent-C-tokenizer.md) is a separate Phase 2c concern. The pipeline fix above does NOT address those — they're emitted by the model, decoded correctly, rendered correctly. See project_qwen36_phase2b_softmax_expf.md. ## Verification - `cargo check --tests -p spark-server`: clean (0 errors, 0 warnings) - `cargo clippy --tests -p spark-server`: clean (0 lints) - `cargo check --workspace --tests`: clean Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Live-test on `atlas-gb10:realfix` showed every response ending at 30-60 content tokens with `gs.accept_token returned false — xgrammar NPDA refused the emitted token` (P3 kill-switch) firing on tok=198 (\n) or tok=29 (structural punctuation). Root cause: in `verify_pick_all_with_pipeline`, the pipeline was applied to each of the K verify positions independently, but the xgrammar matcher state was NEVER ADVANCED between positions — position 1's bitmask was computed against position 0's matcher state. For K=2 spec decode: 1. Position 0: bitmask correct (matches current matcher state) 2. Position 1: bitmask STALE — matcher hasn't seen pick[0] yet 3. Pipeline argmax for position 1 picks a token valid at position 0's state but possibly invalid at position 1's state 4. emit_token(drafts[0]) → accept_token advances matcher → state now at position 1 5. emit_token(verified[1]) → accept_token(pick[1]) — pick[1] is not in this state's bitmask → returns false → P3 ends response with finish_reason=length Example: token 198 (`\n`) is legal at JSON-value-start (post-`:` in `"key": <value>`) but illegal at JSON-comma-or-closebrace (post-value in `"key": "value"`, where the matcher expects `,` or `}`). The bonus token sampled at position 0's state landed on `\n`, but the real position-1 matcher state demanded `,` or `}`. Fix: between each verify position, speculatively `gs.accept_token(pick)` to advance the matcher, then `gs.rollback(grammar_advances)` at end of the helper. The real `emit_token` calls (which run after this helper returns) re-advance the matcher normally based on which tokens the scheduler actually accepts/rejects, so the speculative advance must be reverted to avoid double-advancement. Defensive check on speculative `accept_token` failure: if the pipeline ever returns a token outside the current bitmask (would indicate a real bug elsewhere — pipeline correctly applies the mask so this shouldn't happen except after grammar termination from a forced-token fastpath), log a `debug!` and stop speculating. The real emit_token will still fire P3 if there's a genuine desync — better visibility, no silent corruption. Updated module docstring with the new contract + a "DO NOT REMOVE" note pinning the fix against future "simplification" reverts. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

…nsion eliminated Ran 18 KV cache configurations against the standard 18920-token probe on Qwen3.6-35B-A3B-FP8, comparing per-layer hidden state cosine vs HF BF16 reference. Configs cover all 6 supported KV dtypes (Fp8, Bf16, Nvfp4, Turbo4, Turbo3, Turbo8) × varying --kv-high-precision-layers × varying --fp8-kv-calibration-tokens. ## Findings (full table in VERDICT-final.md) 1. **8 configs gave bit-identical 0.9605 mean cosine** — bf16-all, bf16-hp-max, fp8-hp{5,10,max}, nvfp4-hp{5,10,max}. The chunked-prefill kernel dequants K/V back to BF16 before attention, so the prefill cosine bench is insensitive to storage dtype. KV cache choice does NOT move the precision needle for this probe. 2. **3 calibration configs gave bit-identical 0.9175** — --fp8-kv-calibration-tokens 64, 256, 512 all produce the same result. Calibration is either no-op for this probe or has a different code path that uniformly degrades. 3. **Turbo8 broken on Qwen3.6 hybrid arch** — produces all-NaN hidden states from L19 onward (first SSM-attention-interleave failure beyond the 4 working full-attention layers). Validated only on MiniMax M2.7's pure-attention layout per kv_cache.rs:132. 4. **Per-layer diff vs May 23 rne baseline** — current fp8-baseline is 0.5-1% worse than the May 23 rne dump at most layers. Confirms the project_qwen36_phase2b_softmax_expf memory note: __expf unmasked a deeper precision bug the polynomial was smoothing. ## Conclusion KV cache is exonerated. The ~4% gap to HF reference lives in compute, specifically in the post-Phase-2b kernel surface (RNE + __expf + FP16 P×V). Day 2 starts with bisecting which kernel change unmasked the regression, then per-kernel cosine bisection within each layer to localize. Files preserved: - bench/phase2c-kv-sweep/runner.sh — config sweep automation - bench/phase2c-kv-sweep/VERDICT-day1.md — initial draft - bench/phase2c-kv-sweep/VERDICT-final.md — full 18-config analysis Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Bisect infra (compile + runtime gates for future precision tests): - ATLAS_EXTRA_NVCC_FLAGS env var threaded into build_target::NvidiaTarget::compile — appends arbitrary nvcc args (typically -D<MACRO>=1) to every kernel compile. Cargo rerun gated via rerun-if-env-changed=ATLAS_EXTRA_NVCC_FLAGS. - #ifdef ATLAS_DISABLE_FP16_PV in prefill_paged_compute.cuh — both HDIM=256 and HDIM=512 paths. When defined, reverts smem_P/P64 to __nv_bfloat16, P stores to __float2bfloat16_rn, V reads to direct bf16 packing, and the MMA op to .bf16.bf16. Companion bisect knob to the existing ATLAS_FAST_SOFTMAX_EXP gate (which already flips sw_exp polynomial vs __expf). - ATLAS_DISABLE_RNE=1 runtime env-var in atlas-quant::fp8::f32_to_bf16 and spark-model::weight_map::fp8_lut::f32_to_bf16 — reverts the Phase 2b RNE patch to plain truncation (the pre-Phase-2b behavior). No rebuild needed once the gated binary is shipped. Day 2 bisect results (full table in VERDICT-day2.md): | Config | mean | min | Δ vs baseline | |---|---|---|---| | fp8-baseline (current) | 0.9615 | 0.9179 | — | | -DATLAS_FAST_SOFTMAX_EXP=1 | 0.9617 | 0.9184 | +0.0002 | | -DATLAS_DISABLE_FP16_PV=1 | 0.9619 | 0.9172 | +0.0004 | | ATLAS_DISABLE_RNE=1 | 0.9569 | 0.8990 | -0.0046 | | rne (May 23 reference) | 0.9668 | 0.9290 | +0.0053 | None of the three Phase 2b kernel changes individually accounts for the 0.5% cosine gap vs the May 23 rne reference. RNE is actively helping (disabling it regresses by 0.5%). Both __expf and FP16 P×V are effectively neutral. The rne reference dump's filesystem timestamp predates any docker image currently on disk by 17+ hours, so its lineage is unclear — likely a pre-Phase-2 baseline whose 0.5% gap to current realfix2 reflects accumulated infrastructure changes outside the Phase 2b kernel surface. Direct precision comparison is partly invalid. Day 3 attack vectors: per-sub-step cosine within each layer, MoE expert-routing divergence audit, NVFP4 weight checkpoint comparison. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Days 1+2 falsified the FP8 KV cache hypothesis. Day 3 finds the actual answer: the model degeneration is in the FP8 WEIGHT quantization, not Atlas's compute or KV cache. Same Atlas (atlas-gb10:realfix2), same probe, swap the weight checkpoint from `Qwen/Qwen3.6-35B-A3B-FP8` to `RedHatAI/Qwen3.6-35B-A3B-NVFP4`: | Aspect | FP8 | NVFP4 | |---|---|---| | Throughput (short) | ~65 tok/s | 117 tok/s (1.8×) | | TTFT | ~140ms | 72ms (½) | | TOML quotes | single (invalid) | double (valid) | | Token glue artifacts | "withcurl"/"axut" | none | | Language switching | Indonesian descriptions | none | | Code coherence | garbled | valid Rust+axum | | Deep-context (9780 prompt) tool call | empty filePath, garbage args | "rm -rf … && mkdir -p …" valid shell | | Watchdog amputations | frequent | none | | Long-response finish | doom-loop "length" | natural "stop"/"tool_calls" | Atlas's compute pipeline is correct. NVFP4's per-block FP8 scales × 4-bit E2M1 nibbles give 16× finer per-element precision than dense FP8 E4M3, holding late-layer K/V and MoE weights within reasonable bounds where FP8 hits its E4M3 dynamic-range cliff and quantizes coarsely. Project memory's note "NVFP4 KV best at deep, FP8 best early" was correct — applied to WEIGHTS, not KV storage. Verdict: production deployment of Qwen3.6-35B-A3B should use the NVFP4 checkpoint, not FP8. atlas-gb10:realfix2 works directly with RedHatAI/Qwen3.6-35B-A3B-NVFP4 — no rebuild needed. Recommended invocation in bench/phase2c-kv-sweep/VERDICT-day3.md. Also adds: bench/phase2c-kv-sweep/compare-configs.py — multi-config cosine comparator used throughout the bisects. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Live NVFP4 testing (opencode-nvfp4.jsonl seq=15) hit a doom-loop distinct from the hotfix-2b-handled cases: model emitted a `<tool_call>` opener but never reached a matching `</tool_call>` close. 8221 tokens of buffered tool-call args, all suppressed by the sanitizer's orphan detection at the final flush, finish=length, content="". User saw the request as "trapped" while it burned to max_tokens=8192. The hotfix-2b suppression-streak watchdog couldn't catch this because `suppressing_param_leak` only flipped at the FINAL flush (orphan detection ran at stream end, not during emission) — so the streak counter never accumulated. Fix: new ActiveSeq.tool_body_streak_tokens counter incremented in emit_step::emit_token while a.inside_tool_body=true. Resets on `<tool_call>` open or `</tool_call>` close. When the streak exceeds MAX_TOOL_BODY_TOKENS=1024, set a.finished=true. 1024 is enough headroom for legitimate long tool-call bodies (large `content` field on a `write` call) while bounding worst-case wasted decode at ~15s @ 65 tok/s — vs the ~95s burn at max_tokens= 8192 we observed. Companion to: - hotfix-2b (suppress_streak_tokens) — catches orphan opener detected DURING the stream - content-loop watchdog (decode_logits_content + emit_step) — catches period-N attractors OUTSIDE tool body - max_inter_tool_prose (MODEL.toml) — caps free prose BETWEEN tool calls Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

User-prompted forensic audit of "every logic gate that picks the WRONG quantization for FP8 weights when canonical FP8 paths exist." Key finding: the initial assumption ("all FP8 weights re-quantized to NVFP4") was wrong. Atlas's full-attention QKVO and routed MoE experts DO run on canonical FP8 kernels for Qwen3.6-35B-A3B-FP8 (via `set_fp8_experts` / `set_fp8_weights` in qwen35 loader). The `quantize_to_nvfp4` boot-log lines come from a SMALLER set of weights. 5 bugs identified, ranked by severity: 1. HIGH — SSM decode is always NVFP4 File: weight_loader/qwen35/load_layers/linear_attn_arms.rs:176-190 30 of 40 layers (linear attention) decode through NVFP4-from-FP8 weights. The FP8-native arm `build_linear_attention_fp8` exists in the same file (line 24) but is "permanently short-circuited" per the comment at load_layers.rs:334-342. The kernels and set_fp8_weights plumbing for Qwen3SsmLayer all exist. Likely the actual source of the "deep-layer drift" memory entries (project_qwen36_phase2b_softmax_expf, project_qwen36_drift_gdn_clean). Drift accumulates through the SSM stack and corrupts late full-attention layers L31-L39. Fix: route variant==Fp8Dequanted to build_linear_attention_fp8; remove parallel _nvfp4 build. Medium difficulty. 2. MED — Dead NVFP4 shared expert built but never consumed (ssm_qwen35.rs:184). Memory waste, no correctness impact. 3. MED — MoE router gate loaded via `dense` (raw bytes, no dtype conversion) then quantize_to_nvfp4'd. If gate is FP8 in checkpoint, the bytes are read as BF16. Fix: dense → dense_auto. 4. LOW — LM head same pattern. Currently dormant (Qwen3.6 FP8 leaves lm_head BF16) but latent for any future FP8 checkpoint that quantizes the head. 5. LOW — MTP head double-quantizes via quantize_to_nvfp4 for all projections under MtpQuantization::Nvfp4. Default should be Fp8 when native_fp8. Recommended single fix: Bug #1 (re-enable FP8-native SSM decode). The required kernels and plumbing exist; only the dispatch needs re-wiring. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

Tried to re-wire LinearAttention to use `build_linear_attention_fp8` for `Nvfp4Variant::Fp8Dequanted`. Build succeeded, container crashed at layer 0 load with `cuMemcpyDtoDAsync_v2 status 1` (INVALID_VALUE). Root cause (deeper than the CAUSAL-PATHWAY-AUDIT spotted): the `Fp8Weight` struct's `row_scale` field documents itself as "`[N]` f32 per-row dequant scale" (32 KB for N=8192). But `load_fp8_block_scaled_as_fp8weight` populates that same field with per-BLOCK BF16 scales of shape `[N/BS, K/BS]` (= 2 KB for N=8192, K=2048, BS=128 → 64*16*2 bytes). `build_linear_attention_fp8` then concatenates QKV+Z scales via: gpu.copy_d2d(qkv_fp8.row_scale, qkvz_scale_ptr, qkv_rows * 4) which asks CUDA to copy 32 KB from a 2 KB source. CUDA correctly returns INVALID_VALUE. The downstream `fp8_gemm_n128` kernel that decode_batched calls also doesn't take any scale argument — it expects single-scale FP8 (produced by `bf16_to_fp8` from a BF16 dense), not the block-scaled FP8 the checkpoint ships. So even if the concat were fixed, the math would be wrong (FP8 values un-rescaled by their per-block multipliers). Reverting the dispatch back to NVFP4-only for LinearAttention. Updated comments at both sites to document why the FP8 arm stays dead-coded and what kernel work would be required to revive it. The CAUSAL-PATHWAY-AUDIT.md Bug #1 finding stands but is *open* — the right fix is to rewrite the FP8 SSM GEMV/GEMM chain to consume the per-block scales the checkpoint provides, not to flip a dispatch. In the meantime, the production workaround is to serve the NVFP4 checkpoint (`RedHatAI/Qwen3.6-35B-A3B-NVFP4`). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

…parser Multi-iteration work toward clean opencode multi-turn agentic sessions on Qwen3.6-35B-A3B-FP8. The combined image (atlas-gb10:fp8-much-better) ships: - Native FP8 SSM dispatch (block-scaled `w8a16_gemv` decode + single-scale `fp8_gemm_n128` prefill) — no NVFP4 detour. WeightQuantFormat enum tags Fp8Weight buffers with `Fp8BlockScaled`/`Fp8PerRow`/`Fp8SingleScale` so kernel-call sites can assert format and fail-fast on silent quant leaks. Concat math for QKVZ along the N-block axis writes `(qkv_rows/BS)*(K/BS)*2` bytes of BF16 scales per row, then `(z_rows/BS)*(K/BS)*2` — both Nq and Nz are required to align to BS=128 by the on-disk Qwen FP8 format. Out-proj is loaded block-scaled directly. - Streaming detokenizer fix: content phase now uses the full-decode + slice approach (matching the reasoning path). HF tokenizers' DecodeStream drops the leading metaspace byte at certain BPE-token boundaries for byte-level tokenizers like Qwen's, producing `name = test-rust-axum-v32version` in streamed `<parameter=content>` while non-streaming `tokenizer.decode(&all_toks)` yields the correct multi-line TOML. `tokenizer.decode(&state.all_toks)` + `full[state.emitted..stable_end]` is byte-exact vs non-streaming for both reasoning and content (verified live with temp=0 seed=42). - MoE gate stays BF16 for native_fp8: the `[num_experts, h]` BF16 gate is in the FP8 release's `ignored_layers`; runtime-quantizing it to NVFP4 (4-bit) destroys the precision the router needs at late layers where the top-8 weights cluster in `[0.105, 0.168]`. Non-FP8 variants keep the runtime NVFP4 quantize. - Whitespace-only stream chunks are no longer dropped — `cleaned.is_empty()` instead of `cleaned.trim().is_empty()`. The `sanitize_content_chunk` holdback rollout commits leading `\n ` indent as a chunk; previously the trim filter dropped that chunk and `state.emitted` advanced past the unemitted bytes, permanently losing indent. - PR 73's qwen3_xml parser + schema-driven type coercion cherry-picked (qwen3_xml.rs, type_coerce.rs, group_e.rs tests, registration in tool_parser.rs). Live opencode v40 showed qwen3_xml itself confused the model under thinking_in_tools=true + long opencode prompt, so MODEL.toml defaults back to qwen3_coder (auto-detect from model_type) — but qwen3_xml stays in the binary as a selectable parser for downstream consumers who want the type-coercion behavior. - post-think EOS guard scoped to `require_tool_call`. Was unconditional POST_THINK_MIN_CONTENT=16 — forced short answers like "4" or "Paris" to keep generating into chat-template artefacts (`\nuser\nassistant`). Now only suppresses EOS when tools are required, so plain-text answers let `<|im_end|>` fire as expected. - MODEL.toml `thinking_in_tools = true` (matches user's "MUCH better" recall); `tool_call_parser` default left to auto-detection so qwen3_coder fires under `model_type = qwen3_6_moe`. All WeightQuantFormat constructors tagged. cargo check / cargo clippy green workspace-wide. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

+
+
+def fmt(x: float) -> str:
+    if x != x:  # NaN


+        for h in headers[1:]:
+            deltas = [data[h][i] - base[i] for i in range(40)]
+            mean_d = float(np.nanmean(deltas))
+            max_d = max(deltas, key=lambda x: abs(x) if x == x else 0)


@tbraun96

Applies the FP16 P×V MMA upgrade + __expf softmax replacement from Avarok-Cybersecurity#90 (fix/in-think-tool-call-leak) to the asym prefill kernel that this PR introduced, and pulls the same fix into the upstream symmetric kernels we carried forward unchanged. Motivation (per @tbraun96's PR Avarok-Cybersecurity#92 review + Discord context): the prior `sw_exp` polynomial advertised ~1e-4 max relative error, but verifies at ~5.6e-3 (~0.5%) against `torch.exp`. Across 18920-token attention rows × 10 full-attention layers, that compounds to measurable per-layer cosine drift vs HF reference. The FP16 P×V MMA upgrade trades ~10% prefill slowdown for ~8× higher mantissa precision on the softmax probabilities (P), which is the dominant remaining attention-output drift source on Qwen3.6-35B-A3B-FP8. Files: - kernels/gb10/common/prefill_paged_compute.cuh — direct cherry-pick of the Phase 2c kernel changes + ATLAS_DISABLE_FP16_PV bisect toggle (matches PR Avarok-Cybersecurity#90 byte-for-byte in the fix regions). - kernels/gb10/common/prefill_paged_compute_512.cuh — same sw_exp refactor for the HDIM=512 path. - kernels/gb10/common/prefill_paged_compute_asym.cuh — TQ+ asym fork carries the same precision bug; applied the equivalent fix (helper + __half smem_P/P64 + __float2half_rn + .f16.f16 MMA). Skipped the ATLAS_DISABLE_FP16_PV debug toggle for now (can add later if the team wants to bisect asym paths separately). Q×K stays BF16 (range matters there); P×V becomes FP16 (precision matters, range is bounded [0,1] post-softmax). All bf16 stores now use __float2bfloat16_rn for RNE rounding. Verified locally: nvcc 13.0 compiles all 114 kernels clean on sm_120 including the TQ+ asym variants (bf16k_turbo3v, fp8k_turbo3v, etc).

`compile_qwen3_coder_tool_grammar` was forcing the tool-call body to `json_schema` content type, but the qwen3_coder wire format the model emits — and the parser at `parse_single_b.rs:60-132` expects — is native XML (`<parameter=KEY>VALUE</parameter>`), not JSON. When the FSM was in a JSON-string state and the model wanted to emit a single BPE token that decoded to bytes containing `<parameter=` (or any non-JSON-string-class character), xgrammar masked the entire token and forced a near-substitute. The result, live in opencode multi-turn agentic sessions on 2026-05-25: /tmp/test-rust-axum-v42 → /tmp/test-rust-axu-v4 (drops `m`, drops `2`) /tmp/test-rust-axum-v42 → /tmp/test-rust-axu-m-v4 (drops `2`, inserts `-`) /tmp/test-rust-axum-v42 → /tmp/test-rust-axum-v01 (4→0, 2→1) "filePath":"]}]}]}}}" (JSON-delimiter cascade) Switching body to `any_text` keeps the OUTER `<tool_call>\n<function=…>\n …\n</function>\n</tool_call>` framing constrained by xgrammar (begin/end fixed) while leaving the body bytes unconstrained, so native XML `<parameter=>` blocks pass through intact. Schema validation remains: `validate_single_tool_call` + `backfill_required_params` run host-side after `parse_one_call` (`tool_handlers.rs:46`), catching any actual schema violations after the fact. Mirrors MiniMax's grammar pattern at `compile_minimax_xml_tool_grammar` line 472 which has used `any_text` for the same native-XML reason. Live verification (atlas-gb10:fp8-grammar-any-text, opencode axum v43): - v41 (nvfp4-KV + hp5): 0 files persisted (path corrupted) - v42 (fp8-KV uniform): 0 files persisted (path corrupted) - v43 (this commit): 20 files persisted, byte-exact `/tmp/test-rust-axum-v43/Cargo.toml`, full cargo new skeleton on disk. Investigation thread that landed here: - 40-layer cosine study on dgx2 ruled out model precision drift (Atlas[FP8-native] vs HF[FP8→BF16] mean cos 0.99497, min 0.99012). - KV format swap (v41 nvfp4-hp5 vs v42 fp8-uniform) ruled out KV. - Six parallel Explore agents on dgx1 ruled out: * HF DecodeStream (audit found no live call sites; 8d2cc87 was clean). * Sampler penalties in tool body (`inside_tool_body` flips correctly on token 248058; rep_pen/DRY/freq/presence zeroed in body). * Stop-string holdback (char-boundary safe, flushed on done). * Mistral whitespace-skip char-boundary issue (latent, separate bug). * Sanitizer `state.emitted` cursor desync (real content-phase bug to fix separately; not the tool-arg path). - Three web-research agents pinpointed vLLM PR #35615 (Qwen3Coder spec-decode parameter loss) but reading Atlas's `streaming_impl.rs` showed Atlas already buffers until `</tool_call>` and parses the full inner via `parse_one_call`, so the close-before-drain pattern is structurally absent here. The remaining hypothesis — grammar bitmask over-constraining valid BPE tokens at the JSON-schema body — survived; this commit confirms it. Tests updated (`grammar/tests/qwen3_coder_required.rs`): the previous suite pinned "grammar REJECTS empty body" — a property of the prior `json_schema` body type. With `any_text`, required-param enforcement moves host-side to `validate_single_tool_call`. New tests pin the envelope shape and accept BOTH native XML and legacy JSON bodies (the parser supports both via its JSON fallback at parse_single_b.rs:137). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors F78 (Write family non-empty path) for shell-execution tools. Without this, the qwen3_coder `any_text` body grammar (commit e99159d) accepts an immediately-closed parameter `<parameter=command></parameter>`, backfill_required_params then inserts `""`, validator passes, opencode returns "The argument 'file' cannot be empty. Received ''", and the model burns its turn budget retrying the same empty call. Live opencode v44 hit exactly this pattern: cargo init succeeded twice clean, then 2× empty bash, then the model "started fresh" on a wrong version number and burned to max_tokens. The previous `json_schema` body grammar paired with `enforce_min_length_on_required_strings` (`grammar/schema.rs`) enforced min_length 1 at the FSM level. Lifting that check to the validator post-parse keeps the same invariant while letting the grammar body be `any_text` (the native qwen3_coder XML wire format the parser at `tool_parser/parse_single_b.rs:60-132` actually expects). Verification on v45 (atlas-gb10:fp8-grammar-shellfix): - cargo init succeeded byte-exact `/tmp/test-rust-axum-v45`. - read Cargo.toml clean. - write Cargo.toml with `axum = { version = "= 0.8", features=["json"] }`. - One content-loop attractor (`version = { version =` period-2 repeat) correctly caught by the existing watchdog; model recovered next turn. - 0 empty-bash retries — the new validator stops them at the gate. Out of scope (deferred): - The Write tool's `content` parameter is still accepted as empty — F78 only carves out PATH_KEYS. Adding `content` would interact with legitimate "create empty file" intents, so deferred until live evidence shows it's a regression risk. - The MoE FP8 dequant drift identified by the dgx2 per-op cosine study (`bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, worst op ssm.moe_out @ L20 = 0.91983) — multi-day kernel-level investigation. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Empty-required-string validation rejections (F78 path empty, shellfix command empty) used to emit a `[atlas] Tool call rejected: …` content chunk and trip `stop_string_triggered`. But `handle_tool_call_start` had already emitted the `tool_calls[idx]` header to opencode, so suppressing the args delta left opencode mid-call with no completion. opencode then reported the cryptic `SchemaError(Missing key)` instead of its own per-tool error message ("The argument 'file' cannot be empty. Received ''"), starving the model of actionable feedback. Soft errors (any validation message containing "non-empty") are recoverable — emit the args delta as the model produced them, log a warn, and let opencode's per-tool schema surface its own error. Hard errors (unknown tool name, args not valid JSON, malformed path that looks like a directory) still bail with a content chunk because they cannot be made into a complete tool call at all. Same treatment applied to `handle_complete_tool_call` for the non-streaming complete-call path. Verification (atlas-gb10:fp8-grammar-softrej, opencode v47): - cargo init succeeded byte-exact `/tmp/test-rust-axum-v47`. - Subsequent empty-bash retries surfaced opencode's clean error "The argument 'file' cannot be empty. Received ''" rather than SchemaError(Missing key). - Model still drifts into empty-parameter bodies under MoE FP8 dequant precision drift (see `bench/fp8_dgx2_drift/ MASTER_DRIFT_TABLE.md`, ssm.moe_out @ L20 = 0.91983) — recovery via the cleaner error message alone is insufficient. Sampler- level enforcement (mask `</parameter>` close when param body has emitted zero non-whitespace chars) is the next intervention. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Atlas's MoE FP8 grouped GEMM was applying the BF16 block-scale to each FP8-dequant element BEFORE storing it in SMEM as BF16: float scale = __bfloat162float(S_exp[n_block * k_blocks + k_block]); smem_B[k][n] = __float2bfloat16(E4M3_LUT[byte] * scale); The `__float2bfloat16(LUT * scale)` truncates each scaled weight to BF16's 7-bit mantissa BEFORE the BF16×BF16 MMA, so the scale's precision interacts with EVERY single multiply-accumulate. Per the dgx2 per-op drift study at `bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, this manifested as the worst single Atlas op cosine vs HF[BF16-unquant]: ssm.moe_out @ L20 = 0.91983 (8% deviation; unacceptable). The fix is DeepGEMM's two-level accumulation: 1. SMEM stores `__float2bfloat16(E4M3_LUT[byte])` only — lossless because FP8 E4M3 has 3-bit mantissa and BF16 has 7-bit mantissa. 2. Inner FP32 accumulator collects BF16×BF16 MMA products for one K=128 scale-block (8 K_STEP=16 iterations). 3. At the K-block boundary, multiply the inner accumulator by the block-scale (one BF16→FP32 conversion per K-block per CTA) and add to the outer FP32 accumulator. 4. Reset inner accumulator; advance to next K-block. 5. Final BF16 cast in the epilogue from outer_acc. Net effect: the scale's BF16 truncation now happens ONCE per K=128 partial sum, instead of being baked into every dequanted weight. The FP8 weight precision is preserved through the inner MMA reduction. Per-CTA observation that simplifies the patch: N_TILE=64 < FP8_BLOCK=128 and `cta_n` is always a multiple of N_TILE, so all 64 N-cols of a CTA fall within a single N-scale-block. One scale lookup per K-block per CTA, applied uniformly to all 8 n_tile sub-tiles. No per-tile dispatch needed. Applied to both kernel variants (`moe_fp8_grouped_gemm` v1 with the original thread-to-smem mapping, and `moe_fp8_grouped_gemm_v2` with the coalesced-load thread mapping). **Measured cosine improvement (Atlas vs HF[BF16-unquant], 10382-token canonical prompt, dgx1 single-node)**: Metric BEFORE AFTER Δ L20 ssm.moe_out (worst) 0.91983 → 0.93547 +0.0156 ssm.moe_out mean (n=30) 0.97456 → 0.98133 +0.0068 attn.moe_out mean (n=10) 0.97316 → 0.97806 +0.0049 ssm.out_proj mean (n=30) 0.97813 → 0.98144 +0.0033 layer.hidden_out mean 0.98982 → 0.99219 +0.0024 The 1.56pp lift on the worst single op proves the BF16-scale-truncation was a real precision sink. The remaining ~0.07 gap to perfect cosine at L20 will be addressed by stacking complementary patches: - Native FP8 MMA on SM121 (per agent A6, confirmed available: `mma.sync.m16n8k32.f32.e4m3.e4m3.f32` lowers to `QMMA.16832.F32.E4M3.E4M3`) — eliminates the BF16 input-quantization step entirely AND yields ~2× tensor-core throughput. ~100-200 LoC kernel rewrite. - FP32 residual stream (per agent A8, A1) — replaces `residual_add_rms_norm` BF16 output with the existing FP32 variant. Reduces input-side drift compounding across 40 layers. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

12-hour autonomous mission Epochs 1-3. The 11-agent research synthesis identified three layered interventions to enforce non-empty tool-call parameter values against the Qwen3.6-FP8 multi-turn drift class: A (grammar minLength), B (sampler byte counter), C (parser holdback). This commit ships A + B. The 4 grammar attempts converged on raw EBNF as the only architecturally-correct primitive: - `[\s\S]*\S[\s\S]*` (regex sandwich): ε-transitive `\S` in Kleene closures, FSM skips required content (A3 finding). - `[^ \t\r\n<][^<]*` (regex `+` with structural prefix): inner enforcement works but outer `+` quantifier still ε-transitive — let `</tool_call>` close with zero `<parameter=>` blocks (B6 finding). - `json_schema` with `style: "qwen_xml"` and `minLength: 1` (B6): json-schema converter has same ε-edge bug for `[^]{1,}` lowering. - **EBNF with `value ::= first_char rest` (B5)**: rule INLINING (per llama.cpp's GBNF compiler) means the parent rule is expanded to require `first_char` as a single non-quantified terminal — no ε-transition can skip it. This is what ships. EBNF body for qwen3_coder XML wire format: root ::= param ("\n" param)* param ::= "<parameter=" paramname ">" value "</parameter>" paramname ::= [a-zA-Z_] [a-zA-Z_0-9]* value ::= first_char rest first_char::= [^ \t\r\n<] (forces ≥1 non-WS non-< byte) rest ::= [^<]* Belt-and-suspenders Tier-1 sampler-side byte counter (decode_logits_seq.rs): new ActiveSeq fields `inside_parameter_body: bool` and `param_body_chars_emitted: u32` track per-sequence state. When inside a parameter body AND no content tokens have been emitted yet, logit_bias appends `(510, -8.0)` (`</` close-tag opener) PLUS `(220, 198, 197, 256, 271, -8.0)` (common Qwen3 whitespace tokens). emit_step.rs flag flip detects opener via last-8-token signature `[27, 15704, 28]` ending in `>` (29); decrements on `</` (510); does NOT count whitespace tokens toward the chars counter (so the mask survives a sampled whitespace token). Critical Tier-1 gate: forced_token_fastpath at decode_logits_seq.rs:307 was bypassing the logit_bias by returning the grammar's sole legal token directly. When `tier1_active` (inside_parameter_body && chars==0) the fastpath is skipped so the sampler pipeline actually applies bias. Also enabled `wants_typed_arguments=true` on Qwen3CoderParser so PR-73's `coerce_all` runs schema-aware type coercion (string → number) on bash's `timeout` field — addresses opencode's `SchemaError(Expected number, got "30")` failure mode independent of the param-body work. **v57 measured result** (Epoch 3 EBNF + Tier-1 sampler): - 1 file persisted (Cargo.toml) — best non-precision-stack result - axum 0.8 dependency cleanly added with json feature - Atlas log shows model emitted real axum code: `use axum::{routing::get, Json, Router}; async fn pong() -> Json` - Remaining gap: model sometimes emits 1-char garbage values (single backslash) to satisfy the 1-char minimum; need schema-aware min-length OR Tier-2 closer-suffix holdback. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

12-hour autonomous mission Epoch 4 (final): tightened validate_single_tool_call so WRITE_FAMILY paths must start with '/', './', or '../' AND be ≥3 chars, and SHELL_FAMILY commands must be ≥2 chars after trim. Addresses the model's "satisfy minimum with single garbage char" failure mode that Epoch 3's EBNF rule-inlining enforcement exposed (model emitted single backslash `\` for filePath to satisfy the EBNF ≥1 non-WS non-`<` requirement). Result: v58 test still scored 0 files because the model now drifts to entirely-wrong paths like `/test-tu-au-u8` that PASS the loose "starts with /" validator but bear no resemblance to the requested target. Confirms: no Atlas-side enforcement can teach the model TASK SEMANTICS — only structural validity. Multi-turn task faithfulness gap remains the open problem. Mission summary written to bench/fp8_dgx2_drift/MISSION_PROGRESS.md. Best result of the mission: Epoch 3 (raw EBNF) v57 — 1 file persisted (Cargo.toml with valid axum 0.8 dependency), Atlas logs captured real axum handler code emitted by the model (`use axum::{ routing::get, Json, Router}; async fn pong() -> Json`). EBNF rule-inlining is the only architecturally-correct grammar primitive for non-empty enforcement on Atlas's vendored xgrammar (all 3 prior regex/json_schema attempts blocked by upstream ε-edge bugs). Recommended next-mission focus: multi-turn task faithfulness (FlowKV per-turn KV isolation arXiv:2505.15347, preserve_thinking chat-template fix per A4 community research, native FP8 MMA to close remaining 0.04 cosine gap on MoE block). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Checkpoint of the Qwen3.6-35B-A3B-FP8 vs vLLM cosine-parity investigation. Key fixes: - paged_oproj.rs: W8A8 o_proj called fp8_gemm_t_blockscaled with N and K swapped (passed nq*hd as N, h as K). Fix lifted Atlas-vLLM mean residual cosine 0.800 -> 0.997 on short prompts. - quant_helpers.rs: dequant_fp8_blockscaled_to_bf16 now launches a GPU kernel (dequant_fp8_blockscaled_bf16.cu) instead of a D2H + CPU loop + H2D round-trip. FP8 MoE BF16-dequant load drops ~22min -> ~49s (~130x on the dequant step), bit-identical numerics. Diagnostics (env-gated, inert by default): - op_dump hooks for attn_out_pre_gate/post_gate, k_post_norm, k_post_rope, q_post_rope in qwen3_attention prefill paths. - BF16 MoE grouped GEMM + dispatch (ATLAS_FP8_DEQUANT_MOE_TO_BF16), plus bench/fp8_dgx2_drift cosine + op-cosine harness. Findings: MoE GEMM precision is NOT the drift source (BF16 MoE == FP8 MoE at L39 cos 0.952 vs 0.955; NVFP4 MoE worse at 0.868). Remaining 0.955 floor traced to FP8 attention Q/K/V/O projections. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…iagnostics Root-caused the opencode agentic-harness "wandering" (shell-fragment dir names, collapsed-newline TOML) via a per-decode WS-mask diagnostic — it was NOT numerical drift but two server-side logit/validation bugs vLLM lacks: - validation.rs: the write-path validator required filePath to start with '/', './' or '../', rejecting bare relative paths like `Cargo.toml` that opencode resolves against --dir (and that vLLM accepts). opencode looped on the rejection and abandoned the task. Now accepts any non-empty path without shell metacharacters/whitespace (still rejects leaked commands like `created && ls -R` — CWE-78). - whitespace_mask.rs: the WS1/WS2 mask suppressed the newline token (198) inside tool-param bodies. ATLAS_WS_MASK_DIAG confirmed the model's pre-mask top-1 was `\n`, demoted to `;`/` &&`/` ||` — the exact mechanism producing shell-fragment paths AND collapsed-newline TOML. Now masks only HORIZONTAL whitespace; line breaks are never suppressed. Added ATLAS_WS_MASK_OFF=1 to disable the whole mask (matches vLLM, which applies no whitespace bias at all). Diagnostics / kill-switches (env-gated, inert by default): - ATLAS_WS_MASK_DIAG=1: logs param-body whitespace argmax flips. - ATLAS_DISABLE_FUZZY_REPAIR=1: no-op the SymSpell tool-arg repair. - ATLAS_FP8_DEQUANT_ATTN_TO_BF16=1: dequant attention Q/K/V/O to BF16 (diagnostic, TP=1; confirmed weight precision is not the drift source). - run_tier.sh --bail: exit on first non-clean run. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Generalizes the WS_MASK_DIAG probe into a full per-decode-step JSONL dump for Atlas↔vLLM divergence analysis. Each step records raw model top-K (id,logit) BEFORE Atlas's additive logit-bias stack, the itemized list of every (id,delta) bias applied (WS mask, attractor, A4 think-suppress, C4 lift — none of which vLLM has), the post-bias argmax, and the sampled token, sliced by tool-param-body context. Diffing raw_topk against vLLM's localizes MODEL divergence; the bias list itemizes Atlas-only processor divergence. Env-gated, inert by default; bias is cloned only when active. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

On an exact full-prompt prefix-cache hit (matched == total), the engine re-embeds and re-runs the last prompt token through all layers to produce the first generated token's logits. For SSM/Mamba layers that re-run applies the last token's recurrent update a SECOND time on top of the restored snapshot state@N, double-advancing the recurrent state and corrupting both the first-token logits and the state decode reads. Proven by byte-compare: fresh-prefill SSM state == warm-restored SSM state exactly (all layers) — save/restore was never the bug; the re-run was. Intermediate hits (matched < total) recompute the tail from a checkpoint and process each token once, so they were already correct. Fix: the leaf snapshot now also stashes the last token's post-final-norm hidden (hidden_size BF16). On an exact hit, finalize_last re-restores the pristine SSM state@N (undoing the redundant re-run's advance) and feeds the stashed hidden straight to lm_head, skipping any SSM re-run. The redundant 1-token forward is otherwise harmless (its KV write duplicates cached values). Cold/warm output is now byte-identical and warm hits stay fast. Note: prefill_c.rs prefill_full_cache_hit (two-phase / EP=2) shares the same pattern and is a known follow-up; not on the single-GPU prefill_b path. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ool-call drift) Two opencode-quality fixes for Qwen3.6-35B-A3B-FP8, plus diagnostics. MTP 0% draft acceptance — ROOT CAUSE + FIX With ATLAS_FP32_RESIDUAL the main model's hidden stream is FP32, but the MTP head's step-2 input rms_norm read the saved hidden as BF16 → NaN → all-NaN logits → argmax==0 → every draft was token 0 → verify always rejected → 0% acceptance (no speculative speedup). The token embedding (always BF16) normed fine; only the FP32 hidden was misread, which is why prior weight-quant / vocab / KV-dtype attempts all failed (downstream of the NaN). Fix: the MTP hidden-norm now uses the FP32-input rms_norm (rms_norm_f32, BF16 out) when use_fp32_residual, so fp32-residual and MTP coexist. With the fix MTP accepts ~1.67 drafts/step (~83%) and decode rises from ~34 to ~54 tok/s. Also: mtp_head now honors BF16 KV (gated by mtp_quantization) instead of hard-coding FP8 with unit scales, matching the main model. BF16 router (ATLAS_BF16_ROUTER) — tool-call drift The MoE router/gate was NVFP4 (4-bit); FP8 routing flips drove malformed tool calls (empty filePath, mixed <function_calls> tags). Keeping ONLY the router in BF16 (a tiny num_experts×h GEMM; experts stay FP8 = full speed) eliminates the empty_path drift at no throughput cost. Diagnostics (env-gated, inert by default): ATLAS_TOOLCALL_DEBUG (post-parse arg shape on empty-path), ATLAS_MTP_DEBUG_NORMS (staged MTP ‖·‖ to localize NaN/zero), ATLAS_LMHEAD_BF16 / ATLAS_OFFICIAL_TOOL_PROMPT (A/B levers). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…RECOVERY) FP8 low-margin token flips corrupt the write tool-call filePath (empty, truncated dir, or hallucinated like /tmp/pure_axioms.txt) while the CONTENT is correct — so the Axum server never lands in src/main.rs, cargo builds the default scaffold (false-positive cargo_valid) and port 3001 never binds. recover_drifted_write_path() salvages the model's correct content to the content-inferred in-project path (Cargo.toml / src/main.rs) when the path is unusable. Env-gated (PCND opt-in); recovers intent, never invents content. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Removes ATLAS_FP32_RESIDUAL + use_fp32_residual() and collapses all ~54 call sites to the BF16 branch (residual/hidden buffers, rms_norm / residual-add kernel selection, per-token byte strides). BF16 residual is the shipping config; the FP32 path was incompatible with MTP (it fed the MTP head FP32 hidden it read as BF16 → NaN → 0% draft acceptance). With FP32-residual gone the model is always BF16-residual, so MTP works by default. Orphaned FP32 .cu kernels left in place (unreferenced); the ATLAS_GEMMA4_FP32_LMHEAD scaffolding is now inert (it required FP32 residual) and can be excised in a follow-up. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Three additive, content-recovering fixes for FP8 tool-call drift on the opencode rust-axum harness (Qwen3.6-35B-A3B-FP8), all gated behind the existing ATLAS_WRITE_PATH_RECOVERY flag (PCND opt-in, default-off): 1. Narrate-then-tool fence salvage (tool_salvage): when the model emits no native tool_call but renders a whole file in a bare ```rust/```toml fence, extract_fenced now infers the write target from the body's content shape (fn main(->src/main.rs, [package]->Cargo.toml) and synthesises the write(). Plumbs an infer_paths flag from salvage(). 2. Write-content markdown fence-unwrap (validation): when the model dumps its entire markdown answer into a write() content field (code + stray ``` fence + trailing prose / "Principles Applied"), strip the wrapper and trailing prose for code files so the .rs/.toml compiles. 3. Path quoted-comma sanitizer (validation): strip a trailing comma/ws so a JSON-fragment-shaped filePath like `"/tmp/x/Cargo.toml",` is cleaned instead of creating a garbage-named file. SSOT: extracts classify_path_from_content shared by recover_drifted_write_path and the fence salvage. Never invents content — only recovers the body/path the model produced. +8 unit tests (fence contamination, path-comma, salvage inference); no production behaviour change when the flag is off. Together these produced Atlas's first end-to-end webserver_ok=true on the opencode harness. Residual webserver_ok gap is FP8 decode content-completeness (not serialization) — tracked separately under the BF16-MoE quality work. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… BF16 forward_k2/forward_k3 (the fused multi-token MoE forwards used for MTP verify and 2/3-sequence batched decode) only had FP8 and NVFP4 branches. When experts are dequanted FP8->BF16 at load (ATLAS_FP8_DEQUANT_MOE_TO_BF16), the FP8 source weights are freed, so these forwards read freed memory -> garbage verify logits -> degenerate repetition under MTP. Fix: early-return delegate to forward_batched(input, K, ...) when bf16_gate_weight_ptrs.is_some(). forward_batched already has the tested per-token BF16 expert path and writes the same moe_output()[K,H]. SSOT: reuses the decode BF16 kernels, no new fused batch2/batch3 kernel. Verified: MoE-dequant + MTP is now coherent (was repetition), stable across a full N=10 (no slot leak, no crash). Inert for the FP8/NVFP4 paths (bf16_gate_weight_ptrs is None without the dequant flag). Note: the harness-level BF16-MoE quality lift does not beat the FP8 baseline (decode-speed cost + partial quality); see memory project_bf16_dequant_lever. This commit lands the correctness fix regardless. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Diagnostic probe tests (eprintln, no assertions) exploring TOML auto-repair behaviour on collapsed-one-line manifests with XML content leaks (`</content>`) and missing-`=` shapes observed in earlier harness runs. Scratch from a prior debugging session, preserved on request. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

+
+
+def load_config_rms_eps() -> float:
+    cfg = json.load(open(SNAP / "config.json"))


+
+
+if __name__ == "__main__":
+    sys.exit(main())


…ostics Two env-gated, default-off diagnostics for pinpointing async CUDA faults: - ATLAS_DEBUG_SYNC_KERNELS (kernel_args.rs): synchronize the stream after every KernelLaunch::launch so an illegal-address fault surfaces AT the culprit kernel (with grid/block + a force-captured backtrace) instead of at a later, unrelated sync. Env read once via OnceLock — zero prod cost. - ATLAS_DEBUG_NO_GRAPH (impl_a1.rs): force eager batched decode (suppress CUDA-graph capture) so the per-launch sync above is permitted (you cannot cuStreamSynchronize a capturing stream → 900). Together these pinpointed the BF16 o_proj OOB (next commit). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…_BF16 ms_phase_o_proj (multi-seq batched-decode O projection) had FP8 and NVFP4 branches but NO BF16 branch. Under ATLAS_FP8_DEQUANT_ATTN_TO_BF16 the o_proj weight is dequanted to BF16 (o_dense_bf16), so o_weight.as_fp8() returns None and the dispatch fell through to the NVFP4 w4a16_gemv_batch{2,3} path using the stale FP8/NVFP4 self.attn.o_proj — reading mismatched weight bytes → CUDA_ERROR_ILLEGAL_ADDRESS (700) in batched decode (n>=2 / MTP). Fix: add the o_dense_bf16 branch first (per-token dense_gemv), mirroring the single-seq decode path (attention_forward_oproj.rs). Inert for FP8/NVFP4 (o_dense_bf16 is None without the dequant flag). Verified: full bit-perfect config (MoE+attn BF16 + MTP) now runs coherent batched decode with zero CUDA-700 (was: crash). Enabling fix for the BF16 webserver_ok lever (#15). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Layers a freshly-built spark binary onto the prior alpha image (FROM a local atlas-gb10 tag) to skip the multi-minute CUDA build during iteration — same pattern as Dockerfile.fence/.fast. Used to build the BF16 dequant / debug-sync diagnostic images this session. Base tag is local-only; not reproducible on a fresh clone (matches existing fast-layer Dockerfiles). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

New per-layer gate for the FP8→BF16 dequant path: ATLAS_FP8_DEQUANT_LAYERS takes a comma-separated list of singletons/inclusive ranges (e.g. "31-39" or "31,35,39") of absolute layer indices. Unset → all layers (legacy behaviour, no change). Gates both ATLAS_FP8_DEQUANT_MOE_TO_BF16 and _ATTN_ via a single layer_dequant_selected() helper (parsed once via OnceLock). Motivation: full-BF16 dequant lands the best per-token quality but decode is ~30% slower (2× MoE weight bandwidth → 360s harness timeouts; task #231). Restricting BF16 to the worst-drift late layers (L31-39) keeps the bulk FP8- fast: measured 44.7 tok/s (vs full-BF16 33.1, FP8 47.7) — ~94% of FP8 speed — while reaching cargo_valid 10/10 on the opencode harness (best of any tier; FP8=8/10, full-BF16=9/10, =vLLM 10/10). webserver_ok stays ~0-1/10 across all quant configs: confirmed NOT a quant- quality or decode-speed problem — it's the agentic content-completeness ceiling (model reliably emitting a complete binding server), per the gap forensic. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

FP8 agentic failure mode (forensic gap #9): the model explores (bash ls/cat/ find, read, glob) or narrates across many turns but never writes the deliverable, so the run ends with a valid Cargo.toml but no real src/main.rs — webserver_ok never fires. BW1 tallies tool-call productivity across the conversation (write/edit tools + bash write/build/run verbs = productive; ls/cat/find/grep/read/glob = exploration) and, when the agent has made many calls with zero productive output, appends a steering nudge to the most recent tool response redirecting it to write the file(s) and verify. Escalates to a CRITICAL nudge past ~9 calls. Lives in hint_injector (alongside the error-recovery hints); injected at msg_entry after the conversation scan. Env-gated ATLAS_BASH_WANDER_WATCHDOG=1 (PCND, default-off). +2 unit tests (classifier + threshold/escalation). Targets the only remaining webserver_ok lever after the quant/speed track (cargo_valid 10/10 via selective L31-39); forensic est +0..+2, partly model-inherent. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…sweeps) N=10 run JSONs + aggregate reports for the ws_ok lever investigation: fencecontent/pathrec baselines, bf16moe/bf16both/bf16sel selective-BF16 levers, bw1_sel anti-wander watchdog (confirmed no-op), vLLM 13/13 reference. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ments in Cargo.toml extract.rs: a bare ``` at end-of-content with no trailing newline made body_start = content.len()+1, panicking the slice at extract.rs:121 and crashing the salvage worker (observed live in disable-thinking opencode runs) — so that request's recovered write never landed. Guard body_start > len. toml_repair.rs: the FP8 model sometimes annotates Cargo.toml with C-style `//` comments (TOML uses #), e.g. `name = "x" // not real TOML`, making the file unparseable. Add a conservative char-safe stripper (only outside double-quoted strings, so URLs/# are untouched) as a repair candidate. Both unit-tested (toml_repair 11/11; extract EOF-fence regression). Found during the 12h /loop while investigating the Atlas-vs-vLLM webserver_ok gap. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lling the agentic verify loop The spinning detector counted any assistant turn with content<500 AND tool_args<100 as 'short'; 5 consecutive → suppress_tool_call (-12 bias on <tool_call> start). But in an agentic coding loop the verify cycle (bash cargo build / cargo run / curl, read, small edit) is a run of legitimately short-arg tool calls — so it tripped after ~5 turns and hard-masked the next tool call, killing the build→error→fix→rebuild loop. Measured: Atlas capped at ~4-5 turns / 2 bash vs vLLM's 12-17 turns / 8 bash on the same task (vLLM has no such suppressor → 10/10 webserver_ok). Fix: a turn that issued ANY non-empty tool call is taking an action (progress), so it is substantial — spinning now only fires on consecutive short PURE-TEXT turns. Genuine repeated-tool-call loops remain caught by loop_detector::detect (the Suppress verdict). Live result: turns 4→9-15, bash 2→5-7, 'Spinning detection fired' → 0, builds now succeed, and Atlas landed its FIRST webserver_ok=true on the opencode harness (real axum server, /ping→pong on 0.0.0.0:3001). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…under-covers a full-prompt match #226 warm-hit SSM corruption: when an intermediate checkpoint matched at full prompt length (ssm_snapshot_tokens < matched == total — e.g. the leaf snapshot was LRU-evicted from the 16-slot pool, leaving only a block-aligned checkpoint), prefix_lookup.rs forced skip_tokens=matched, restoring SSM state to snap_tok but advancing positions/KV to total → stale recurrent h_state/conv_state by (total-snap_tok) tokens → first decoded token reads misaligned state → healthy logits but wrong → immediate stop / empty completion. Clusters in later N=10 runs (warm cross-request cache). Fix: skip only to snap_tok so suffix-prefill replays SSM over the gap (mirrors the matched<total intermediate path); exact-leaf hit (snap_tok==matched==total) still routes through the 25f8bbe fixup. Same guard added to the EP=2 two-phase path (prefill_c.rs). Live N=10 (cache ON): give-ups 4→1, ws_ok 2→3/10. PARTIAL — a residual warm-hit path remains (cache-OFF still best at 6/10). Correct direction (fix corruption vs disable caching); residual is a known follow-up. run_tier.sh: OC_TIMEOUT env knob (default 360) for de-confounding runs. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…pro tooling + bench evidence prefix_lookup.rs: env-gated (default OFF) bypass of the exact-leaf snapshot shortcut — routes exact full-prompt hits through full recompute. Used to rule out the marconi_exact_snap fixup as the cache-ON regression cause (bypass still 2/10). Inert in production. Adds prefix_partial_hit_repro.py + N=10 run JSONs from the prefix-cache investigation (spinfix/pfxfix/pfxfix2/exactprobe/cacheoff2/bf16* tiers). Findings (memory project_gaps_to_10of10): single-forward partial-hit is bit-exact (verified, identical-prompt cold-vs-warm); harness cache-ON regression (23% vs cache-OFF 65%, z=3.2) is real but lives in the multi-turn/eviction/batched regime — next: instrumented multi-turn repro. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…og hardening First clean Atlas-FP8 webserver_ok (atlas_capfix run2: build+bind+ping=pong, 9 turns, 309s). Snapshot of the fix/in-think-tool-call-leak working tree at the capfix milestone (the exact state atlas-gb10:capfix was built from). - F1: unconditional per-generation post-think content-token cap (max_post_think_content_tokens, default 100_000 no-op; qwen3.6 MODEL.toml=1536), plumbed atlas-kernels -> WatchdogParams; caps in emit_step.rs + decode_logits_content.rs - F4: sticky tool_request bool so the inter-tool prose-budget survives grammar disengage - F5: max_inter_tool_prose 2048 -> 1024 (qwen3.6 MODEL.toml only) - F2: bounded value-rest rest_part{0,6000} behind ATLAS_GRAMMAR_VALUE_HARDEN (default OFF) - harness: aggregate.py/run_tier.sh exit code = total cargo+webserver failure count Revert target (pre-session committed base): 7e8e2d6 Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… serialization parity ROOT CAUSE of the opencode webserver_ok gap (Atlas-FP8 ~0/10 vs vLLM-FP8 10/10 on IDENTICAL FP8 weights): repetition_penalty=1.1 in MODEL.toml [sampling.*]. At the harness temp=0, the frequency penalty (256-token window) suppresses the most-recurrent code tokens (newline id198, `::` id476) and flips low-margin greedy picks -> collapsed newlines / `::`-splitting / prose-wander / runaway-to-length. vLLM runs penalty-free greedy so never flips -- which is why the *more precise* engine was the failing one. NOT a precision floor: Atlas-vs-BF16 cosine 0.99523 > vLLM 0.98482 at 40/40 layers. - kernels/gb10/qwen3.6-35b-a3b/MODEL.toml: repetition_penalty 1.1->1.0 in all four [sampling.*] presets; dry_multiplier=0.5 KEPT (DRY catches the verbatim filename runaway that prompted the prior revert; rep_penalty is the code-corruptor). Note the selector (sampling_setup.rs:53) uses tools|thinking_text|non_thinking; thinking_coding is currently dead but set to 1.0 too. - tool-JSON: serde_json + minijinja `preserve_order` + a custom spaced `tojson` filter so the <tools> block byte-matches transformers/vLLM (Cargo.toml, spark-server/Cargo.toml, tokenizer.rs, jinja_helpers.rs). Result: norep2 N=10 = 8/10 webserver_ok, 9/10 cargo_valid (was 0/39 across all configs this session). Also re-enables the fast GPU-greedy decode path (gated off when repetition_penalty != 1.0). Also carries prior env-gated additions (default-off, inert unless enabled): ATLAS_LOOP_NO_SUPPRESS (loop_detect.rs), ATLAS_TOOL_SHORT_TRIGGER (compile_tools.rs), qwen3_coder leak-markers (qwen3_coder.rs). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ids, unify logit paths, fix >=-merge grammar boundary Drives Qwen3.6-35B-A3B-FP8 opencode webserver_ok from 8/10 to a clean 10/10 with MTP on (BF16-KV, N=10), and proves the long-standing gap was never an "FP8 precision floor" but a stack of Atlas-specific band-aids plus a tokenizer grammar-boundary bug. Grammar (compile_tools.rs): - value ::= leading_ws* first_content rest, so a leading newline (the model's genuine top-1 at content-start) is no longer masked into a lean/cargo drift runner-up; keeps the non-empty guard. - Exclude =/> from first_content. The param key closes with '>' and the tokenizer fuses it into >X merge tokens (>= is id 9628); at the <parameter=KEY>->value boundary the model could emit >=, depositing a phantom '=' as the value's first char and breaking edit oldString matches (the "give-up": 3 failed edits -> watchdog kill of a coherent agent). Excluding =/> makes xgrammar reject the merge token at the boundary. Remove Atlas-only band-aids (no analogue in vLLM/SGLang/TRT-LLM): - AM1 lean-attractor logit bias + pos-0 close-tag bias + tier1 fast-path bypass + the attractor_mask module. - Whole tool-call repair/salvage stack: toml_repair (SC1), fuzzy_repair (A2/SymSpell), tool_retry (Tier5c), tool_salvage (narrate->tool), repair_json, and the validation.rs ATLAS_WRITE_PATH_RECOVERY block. - Whitespace mask (WS1/WS2). - Dead env vars: ATLAS_DISABLE_FUZZY_REPAIR, ATLAS_WRITE_PATH_RECOVERY, ATLAS_TOOL_RETRY. Unify the two per-token logit paths (SSOT): the non-MTP decode path was a stale inline copy of run_pipeline; both now route through process_position_logits. Fixes A4 being silently dead on the MTP verify path; B1 margin detector moved to b1_margin.rs (FinalDecode-gated). Three guard tests make the "logit rule on one path only" regression impossible. MODEL.toml: dry_multiplier 0.5->0.0 (unlocks the on-GPU fast-greedy path, ~30->48 tok/s on BF16-KV; rep_penalty already 1.0). SSM slot-leak-on-abort fix (RAII SlotGuard, ssm_pool.rs) so no-MTP runs do not brick. Result: MTP-on BF16-KV N=10 = 10/10 webserver_ok, 0 phantom-= corruptions, all drift markers 0, 49 tok/s. spark-server 459 tests + spark-model 69 green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

…lit-K negative result ATLAS_DECODE_TIMING=1 (gated, zero-cost off): per-token host-path stage timers in process_decode_logits splitting copy+forward-wait vs the 248k-vocab host sampling loops. Measured: host sampling is ~0.68ms/tok (negligible); the per-token decode cost is the GPU forward. split_ref_seqs: reverted a single-stream split-K unpin experiment — a clean A/B (same 21.8k code task) was byte-identical (12.7 tok/s both), confirming attention occupancy is NOT the long-ctx decode bottleneck. Determinism pin kept intact; negative result documented in-code. These localize the real bottleneck (separate kernel work): prefill is ~20x slower than vLLM, owned by the naive W8A16 projection GEMM (38%) + MoE FFN GEMM (40%) at 1.4-7 TFLOP/s, NOT the GDN scan (11%). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>

tbraun96 and others added 3 commits May 22, 2026 21:12

tbraun96 requested a review from AzeezIsh as a code owner May 23, 2026 13:45

tbraun96 and others added 17 commits May 24, 2026 13:20

github-code-quality Bot found potential problems May 25, 2026

View reviewed changes

TheTom mentioned this pull request May 25, 2026

feat(tq+): TurboQuant+ KV cache — canonical Hadamard + 9 asymmetric variants + Turbo2 + dispatch tests #92

Open

9 tasks

tbraun96 and others added 5 commits May 25, 2026 16:05

tbraun96 mentioned this pull request May 28, 2026

[bug] #100

Open

tbraun96 and others added 9 commits May 28, 2026 15:58

github-code-quality Bot found potential problems May 31, 2026

View reviewed changes

Comment thread bench/fp8_dgx2_drift/c1_final_logit_overlap.py

def load_config_rms_eps() -> float:

cfg = json.load(open(SNAP / "config.json"))

Comment thread bench/fp8_dgx2_drift/c1_final_logit_overlap.py

if __name__ == "__main__":

sys.exit(main())

tbraun96 and others added 10 commits May 31, 2026 15:12

tbraun96 changed the title ~~fix(streaming): detect & cancel in-think <tool_call> leak (Qwen3.6 + opencode)~~ fix(coherence): Debugging to Get Qwen Working In Agentic Coding Jun 1, 2026

tbraun96 mentioned this pull request Jun 3, 2026

[bug] Results are not the same quality as vLLM #40

Closed

tbraun96 and others added 3 commits June 3, 2026 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90

fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90
tbraun96 wants to merge 50 commits into
mainfrom
fix/in-think-tool-call-leak

tbraun96 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def load_config_rms_eps() -> float:
		cfg = json.load(open(SNAP / "config.json"))

Conversation

tbraun96 commented May 23, 2026

Fix — two layers

Layer D (hotfix, 1 line)

Layer A (real fix, ~75 LoC in 2 files)

Layer B (deferred)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant